Looking at global tourism trends before the international travel entry restrictions due to the Covid-19 paramedic, tourism in Japan is on a rise, especially Tokyo. Tokyo is the largest metropolitan in the world. It is one of the most visited tourism destinations as it offers many unique experiences.Tokyo is a metropolitan prefecture comprising administrative entities of special wards and municipalities. Almost three-quarters of the population of Tokyo live in the eastern part of Tokyo in what are referred to as the 23 special wards. So, they are considered as the core and the most populous part of Tokyo.Each ward has a distinct character of its own for tourists and travelers to explore.
As many tourists travel to experience the different culture, different traditions, and gastronomy. It is difficult for tourists to make choices among many options on travel essentials because everyone has their own preferences of where to go and it is all so fragmented that one has to assemble it themselves, especially if one is interested in local/non-touristy recommendations.
Thus, leveraging Foursquare data and machine learning to build a recommendation and segmentation by applying Foursquare API location data, regional clustering of venue information would help to develop a personalized travel planning system to provide users with a travel schedule planning service and to determine what might be the ‘best’ areas for different activities ranging from accommodations, attractions, restaurants, parks and more, in order to ensure that they would have the best promising experience during their stays in Tokyo.
An area that will be analyzed in this project: Tokyo’s special wards.
Factors that will influence the decision:
Data Sources:
Web Scraping - extract Tokyo major districts and wards table from Wikipedia.
Geospatial data of the districts, wards and attractions via Geocoders
Before we get the data and start exploring it, let's import all required libraries...
! pip install jupyter-conda
! pip3 install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
! pip install BeautifulSoup4
from bs4 import BeautifulSoup
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
!pip install git+git://github.com/geopandas/geopandas.git
import geopandas as gpd
!pip install geoplot
import geoplot as gplt
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
# libraries for displaying images
from IPython.display import Image
from IPython.core.display import HTML
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
# import k-means from clustering stage
from sklearn.cluster import KMeans
import seaborn as sns
from matplotlib import pyplot as plt
print('Folium installed')
print('Libraries imported.')
url = "https://en.wikipedia.org/wiki/Special_wards_of_Tokyo#List_of_special_wards"
source = requests.get(url).text
soup = BeautifulSoup(source, 'html5lib')
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>
# we can see how many tables were found by checking the length of the tables list
len(tables)
#print(tables[index].prettify())
Scrape data from HTML table into a DataFrame using BeautifulSoup and read_html
tokyo_data = pd.read_html(str(tables[3]), flavor='bs4')
tokyo_data
#Create a dataframe with table
tokyo_wards = tokyo_data[0] #pd.read_html(str(tables[3]), flavor='bs4')[0]
tokyo_wards = tokyo_wards.rename(columns = {tokyo_wards.columns[2] : 'Ward', tokyo_wards.columns[-2] : 'Area', tokyo_wards.columns[-3] : 'Density', tokyo_wards.columns[-4] : 'Population'} )
tokyo_wards#.tail()
#Drop unused columns and the last row
tokyo_wards_data = tokyo_wards.drop(['No.', 'Flag'], axis=1)
tokyo_wards_data.drop([23], inplace=True)
tokyo_wards_data
Get the coordinates of 23 special wards using GeoCoder
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Tokyo_explorer")
tokyo_wards_data['Major_Dist_Coord']= tokyo_wards_data['Ward'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
tokyo_wards_data[['Latitude', 'Longitude']] = tokyo_wards_data['Major_Dist_Coord'].apply(pd.Series)
tokyo_wards_data.drop(['Major_Dist_Coord'], axis=1, inplace=True)
tokyo_wards_data
Split the "Major Districts" column and rename the column to "District"
tokyo_wards_data_dist = tokyo_wards_data.drop('Major districts', axis=1).join(tokyo_wards_data['Major districts'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('District'))
tokyo_wards_data_dist.head()
Get the coordinates of districts
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Tokyo_explorer")
tokyo_wards_data_dist['Major_Dist_Coord_']= tokyo_wards_data_dist['District'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
tokyo_wards_data_dist[['District Latitude', 'District Longitude']] = tokyo_wards_data_dist['Major_Dist_Coord_'].apply(pd.Series)
tokyo_wards_data_dist.drop(['Major_Dist_Coord_'], axis=1, inplace=True)
tokyo_wards_data_dist
#save
# Export dataframe to csv, If later we want to start with a csv copy
tokyo_wards_data_dist.to_csv('tokyo_wards_data_dist.csv',index=False)
#read
tokyo_wards_data_dist = pd.read_csv('tokyo_wards_data_dist.csv')
tokyo_wards_data_dist.head(3)
With the query, "Top attractions in tokyo", Google gives the first 4 places according to the quality rating. (https://www.google.com/travel/things-to-do?g2lb=2502548%2C2503771%2C2503780%2C4258168%2C4270442%2C4306835%2C4317915%2C4328159%2C4371334%2C4401769%2C4419364%2C4463666%2C4482438%2C4486153%2C4491350%2C4492925%2C4517257%2C4523593%2C4524133%2C4526388%2C4270859%2C4284970%2C4291517&hl=th-US&gl=us&ssta=1&dest_mid=%2Fg%2F12lnhn10f&dest_state_type=main&dest_src=ts&sa=X&ved=2ahUKEwjxiqPQqM_vAhUpIjQIHfIGAw4QuL0BMAJ6BAgIEDg#ttdm=35.640149_139.792202_11&ttdmf=%2525252Fm%2525252F03k987)
Create the data file to simplify with the first four places from Google, for simplicity:
data = {'Attraction': ['Sensō-ji Temple', 'Tokyo Skytree', 'Tokyo Tower', 'Meiji Shrine'],
'Address': ['2-3-1 Asakusa, Taitō-ku, Tokyo', '1 Chome-1-2 Oshiage, Sumida City, Tokyo','4 Chome-2-8 Shibakoen, Minato City, Tokyo', '1-1 Yoyogikamizonocho, Shibuya City, Tokyo'],
'Ward': ['Taitō', 'Sumida','Minato', 'Shibuya'],
'District': ['Asakusa', 'Oshiage','Shibakoen','Shibuya' ]}
# df = pd.DataFrame (data, columns = ['Attraction','Address','District','Latitude of Attraction', 'Longitude of Attraction','Ward'])
df = pd.DataFrame (data, columns = ['Attraction','Address','District','Ward'])
#Get lat and lng of the each attraction
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Tokyo_explorer")
df['Major_Dist_Coord']= df['Attraction'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df[['Attraction Latitude', 'Attraction Longitude']] = df['Major_Dist_Coord'].apply(pd.Series)
df.drop(['Major_Dist_Coord'], axis=1, inplace=True)
df
#save
# Export dataframe to csv, If later we want to start with a csv copy for task 2
df.to_csv('tokyo_attractions.csv',index=False)
df = pd.read_csv('tokyo_attractions.csv')
df
Firstly, get the geographical coordinates of Tokyo...
address = 'Tokyo'
geolocator = Nominatim(user_agent="Tokyo_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Tokyo are {}, {}.'.format(latitude, longitude))
#Check data types:
df.dtypes
tokyo_wards_data_dist.dtypes
Download countries geojson file
!wget --quiet https://raw.githubusercontent.com/dataofjapan/land/master/tokyo.geojson
print('GeoJSON file downloaded!')
bcn_geo = r'tokyo.geojson' # geojson file
The ward names in the geojson file end with "Ku". While in Wikipedia table does not. So, to merge geojson df (bcn_geo) with wikipedia table, I will use "Kanji" and "ward_ja" columns for merging, instead.
from shapely.geometry import shape
import json
gdf = gpd.read_file(bcn_geo)
gdf = gdf.merge(tokyo_wards_data_dist, left_on="ward_ja", right_on="Kanji")
gdf = gdf.drop(columns=['ward_ja','ward_en','Kanji','area_ja','area_en','code','Population','Area','Latitude','Longitude'])
gdf
gdf.to_csv('gdf.csv',index=False)
gdf = pd.read_csv('gdf.csv')
Visualize the population density of Tokyo's special wards
# Initialize the figure
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(16, 12))
# Set up the color sheme:
import mapclassify as mc
scheme = mc.Quantiles(gdf['Density'], k=8)
# Map
gplt.choropleth(gdf,
hue="Density",
linewidth=.1,
scheme=scheme, cmap='Dark2',
legend=True,
edgecolor='black',
ax=ax
);
ax.set_title('Population Density (/km^2) in Each Ward of Tokyo', fontsize=13);
Now let's create a map of the major districts in 23 special wards using latitude and longitude values to check if they are correct locations...
map_jp = folium.Map(location=[latitude, longitude], zoom_start=11)
# add markers to map
for lat, lng, label in zip(gdf['District Latitude'], gdf['District Longitude'], gdf['District']):
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_jp)
map_jp
As seen, there are several misplaced geolocations that need to get fixed.
Before fixing the misplaced locations, we first join "df" with "tokyo_wards_data_dist" to select only districts in the wards where the top 4 attractions located
df_tokyo = gdf.merge(df, on="Ward")
df_tokyo = df_tokyo.drop(columns=['Address','geometry','District_y', 'Attraction', 'Attraction Latitude','Attraction Longitude' ])
df_tokyo = df_tokyo.rename(columns={"District_x": "District"})
df_tokyo.reset_index()
df_tokyo
map_jp_err = folium.Map(location=[latitude, longitude], zoom_start=11)
# add markers to map
for lat, lng, label in zip(df_tokyo['District Latitude'], df_tokyo['District Longitude'], df_tokyo['District']):
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=1,
parse_html=False).add_to(map_jp_err)
map_jp_err
map_jp_err.save('visualize Err Loc Map.html')
According to the map above, there are coordinate values of several districts that are off. This error can easily be found since there are chances that places around the world share the same name.
Let's fix...
Districts: Aoyama, Hiroo, Mita
According to the https://latitude.to/articles-by-country/jp/japan/34507/aoyama-minato-tokyo
The Aoyama's correct location: 35.6720 139.7230
The Hiroo's correct location: 35.6505 139.7173
The Mita's correct location: 35.6472 139.7409
#Aoyama
df_tokyo.loc[df_tokyo['District'] == 'Aoyama']
#The district isn't found using above query...
#The district is found by having space
df_tokyo.loc[df_tokyo['District'] == ' Aoyama']
df_tokyo.iloc[16]
#Replace with the correct lat and lng values
df_tokyo.at[16,['District Latitude', 'District Longitude']]= [35.6720,139.7230]
#check
df_tokyo.iloc[16]
#Hiroo
df_tokyo.iloc[6]
#Replace with the correct lat and lng values
df_tokyo.at[6,['District Latitude', 'District Longitude']]= [35.6505,139.7173]
#check
df_tokyo.iloc[6]
#Mita
df_tokyo.loc[df_tokyo['District'] == ' Mita']
df_tokyo.at[13,['District Latitude', 'District Longitude']]= [35.6472,139.7409]
df_tokyo.iloc[13]
gdf.astype({'District Latitude': 'int32', 'District Longitude': 'int32'}).dtypes
#save
# Export dataframe to csv, If later we want to start with a csv copy for task 2
df_tokyo.to_csv('df_tokyo.csv',index=False)
df_tokyo = pd.read_csv('df_tokyo.csv')
df_tokyo
#Recheck
# create map of districts in Tokyo using latitude and longitude values, to check if they are correct
map_jp_err_recheck = folium.Map(location=[latitude, longitude], zoom_start=11)
# add markers to map
for lat, lng, label in zip(df_tokyo['District Latitude'], df_tokyo['District Longitude'], df_tokyo['District']):
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='#b300fa',
fill=True,
fill_color='#b300fa',
fill_opacity=0.7,
parse_html=False).add_to(map_jp_err_recheck)
map_jp_err_recheck
map_jp_err_recheck.save('visualize Recheck Map.html')
Now I have all the wards and districts of Tokyo, along with top 4 attractions. Then, I will use the FourSquare APIs to fetch all the venues surrounding those locations to explore the districts and wards of the attractions.
CLIENT_ID = 'R3X02K1RQ43LXURRIDKKOE05Z1OOU531DTWXN2YYDGEKT5QB' # your Foursquare ID
CLIENT_SECRET = 'URUDLZJQNML4HHCDHF3A2GWAS2DR4QJAFXN2C12FNIJTZCFF' # your Foursquare Secret
ACCESS_TOKEN= '42UCQQVMZYZTOJUIHEHQYTKI1NUYE44RQ15A45XH3ZST31AS'
VERSION = '20180604'
LIMIT = 100 # limit of number of venues returned by Foursquare API
# radius = 1500 # define radius
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
#Create a function to get the top 100 venues in each district
#Venue Recommendations
def getNearbyVenues(names, latitudes, longitudes, radius):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
# v['venue']['location']['address'],
v['venue']['location']['distance'],
v['venue']['categories'][0]['name'],
v['venue']['id']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['District',
'District Latitude',
'District Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
# 'Address',
'Venue distance',
'Venue Category',
'Venue ID']
return(nearby_venues)
#-------------------------------------------------------------------------------------------------
#Venue Details
def get_venue_details(venue_id):
#url to fetch data from foursquare api
url = 'https://api.foursquare.com/v2/venue/{}?&client_id={}&client_secret={}&v={}'.format(
venue_id,
CLIENT_ID,
CLIENT_SECRET,
VERSION)
# get all the data
results = requests.get(url).json()
print(results)
venue_data=results['response']['venue']
venue_details=[]
try:
venue_id=venue_data['id']
venue_name=venue_data['name']
venue_likes=venue_data['likes']['count']
venue_rating=venue_data['rating']
venue_tips=venue_data['tips']['count']
venue_details.append([venue_id,venue_name,venue_likes,venue_rating,venue_tips])
except KeyError:
pass
column_names=['ID','Name','Likes','Rating','Tips']
df = pd.DataFrame(venue_details,columns=column_names)
return df
Utilizing the Foursquare API (Venue Recommendations) to explore each district's top 100 venues within a radius of 2500 m. and create a new dataframe, "Tokyo_venues".
Tokyo_venues = getNearbyVenues(names=df_tokyo['District'],
latitudes=df_tokyo['District Latitude'],
longitudes=df_tokyo['District Longitude'],
radius=2500
)
Tokyo_venues_csv = Tokyo_venues.merge(gdf[['District','Ward']], on='District')
Tokyo_venues_csv
Tokyo_venues_csv.to_csv('Tokyo_venues_csv.csv',index=False)
Tokyo_venues_csv = pd.read_csv('Tokyo_venues_csv.csv')
Although the foursquare API helped to find a lot of venues, it could extract duplicates venues if 2 centroids are too close together. So, let's check if there is any duplicated venue.
Count how many duplicates there are...
Tokyo_venues_csv.duplicated('Venue ID').value_counts()
Drop duplicates by keeping the first one only
Tokyo_venues_csv = Tokyo_venues_csv.sort_values(['Venue ID','Venue distance'] , ascending=[False, True])
Tokyo_venues_csv = Tokyo_venues_csv.drop_duplicates(subset='Venue ID', keep='first')
Tokyo_venues_csv
#Recheck
Tokyo_venues_csv.duplicated('Venue ID').value_counts()
#save the change
Tokyo_venues_csv.to_csv('Tokyo_venues_csv.csv',index=False)
Tokyo_venues_csv = pd.read_csv('Tokyo_venues_csv.csv')
Let's see how many districts in these 4 wards
tot_distr = Tokyo_venues_csv['District'].unique()
print (tot_distr)
num_distc = tot_distr.shape
num_distc
#Find how many unique venues in the districts
tot_unique = Tokyo_venues_csv['Venue Category'].unique()
tot_unique.shape
So, there are 19 districts in 4 wards with 181 unique venues.
Let's visualize the venues in the districts...
# create map
map_num_distc = folium.Map(location=[latitude, longitude], zoom_start=11)
# add markers to map
for lat, lng, poi, label in zip(Tokyo_venues_csv['Venue Latitude'], Tokyo_venues_csv['Venue Longitude'], Tokyo_venues_csv['Ward'], Tokyo_venues_csv['Venue']):
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=1,
parse_html=True).add_to(map_num_distc)
map_num_distc
map_num_distc.save('map_num_distc.html')
Let's assign numbers to the wards for color clustering...
Tokyo_venues_csv['No.Ward'] = Tokyo_venues_csv['Ward'].astype('category').cat.codes
Tokyo_venues_csv.tail(20)
Tokyo_venues_csv.to_csv('Tokyo_venues_csv.csv',index=False)
Tokyo_venues_csv = pd.read_csv('Tokyo_venues_csv.csv')
#Clustered venues by ward
map_num_distc2 = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(4)
ys = [i + x + (i*x)**2 for i in range(4)]
colors_array = cm.spring(np.linspace(0, 1, len(ys)))
spring = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(Tokyo_venues_csv['Venue Latitude'], Tokyo_venues_csv['Venue Longitude'], Tokyo_venues_csv['Venue'], Tokyo_venues_csv['No.Ward']):
label = folium.Popup(str(poi), parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color=spring[cluster-1],
fill=True,
fill_color=spring[cluster-1],
fill_opacity=0.5).add_to(map_num_distc2)
#Display the attrictions
# add markers to map
for lat, lng, label in zip(df['Attraction Latitude'], df['Attraction Longitude'], df['Attraction']):
label = folium.Popup(label, parse_html=True)
folium.Marker(
[lat, lng],
popup=label,
icon = folium.Icon(color='cadetblue',icon = 'heart',prefix='glyphicon')
).add_to(map_num_distc2)
map_num_distc2
map_num_distc2.save('map_num_distc_colored.html')
First, let's extract venue IDs
s = Tokyo_venues_csv[Tokyo_venues_csv['Venue'].str.contains("Senso-ji")]
s
Sensoji = s.iloc[0][8]
Sensoji
t = Tokyo_venues_csv[Tokyo_venues_csv['Venue'].str.contains("Tokyo Skytree")]
t
The Foursquare API gives correct coordinates of Tokyo Skytree (東京スカイツリー), but wrong district. 35°42'36.2"N 139°48'38.6"E is actually "1 Chome-1-83 Oshiage, Sumida City, Tokyo 131-0045, Japan", according to Google Map.
TokyoSkytree = t.iloc[1][8] #row 4, col 9
TokyoSkytree
tt = Tokyo_venues_csv[Tokyo_venues_csv['Venue'].str.contains("Tower")] #search in df
TokyoTower = tt.iloc[0][8] #extract
TokyoTower
MJ = Tokyo_venues_csv[Tokyo_venues_csv['Venue'].str.contains("Meiji|Shrine")]
Meiji = MJ.iloc[0][8]
Meiji
Get ratings and likes from Foursquare using Get Details of a Venue API
venube_id = Sensoji, TokyoSkytree,TokyoTower, Meiji
venube_idd = pd.DataFrame(venube_id, columns = ['Venue ID'])
venube_idd
# venube_idd = venube_idd.merge(Tokyo_venues_csv[['Venue ID', 'Venue']], on='VenueID')
result_m = pd.merge(venube_idd, Tokyo_venues_csv, on=["Venue ID"])
result_m
#Ratings & likes of venues
rating_df = []
for k in range(result_m.shape[0]):
venue_id = result_m['Venue ID'][k]
url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
result = requests.get(url).json()['response']
# rating = result["response"]["venue"]["rating"]
# likes = result["response"]["venue"]["likes"]['count']
# rating_df.append(rating)
# rating_df.append(likes)
rating_df.append([(result['venue']['name'],
result['venue']['rating'],
result['venue']['likes'].get('count'))])
rate_df = pd.DataFrame([item for rating_df in rating_df for item in rating_df])
rate_df.columns = ['Venue','Rating','Likes']
rate_df
Now let's go back to the venue dataframe to find the most common venue categories in each area, by doing a quick check to see how many venues have been returned for each district.
The size of the resulting dataframe:
Tokyo_venues_Category = Tokyo_venues_csv.groupby(Tokyo_venues_csv['District']).count()
Tokyo_venues_Category = Tokyo_venues_Category.drop(columns=['Ward', 'No.Ward'])
Tokyo_venues_Category
print('There are {} uniques categories in 4 wards/19 districts.'.format(len(Tokyo_venues_csv['Venue Category'].unique())))
# create a dataframe of top 15 frequent categories
Tokyo_Venues_Top15 = Tokyo_venues_csv['Venue Category'].value_counts()[0:15].to_frame(name='frequency')
Tokyo_Venues_Top15=Tokyo_Venues_Top15.reset_index()
Tokyo_Venues_Top15.rename(index=str, columns={"index": "Venue_Category", "frequency": "Frequency"}, inplace=True)
Tokyo_Venues_Top15
import seaborn as sns
from matplotlib import pyplot as plt
s=sns.barplot(x="Venue_Category", y="Frequency", data= Tokyo_Venues_Top15)
s.set_xticklabels(s.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title("15 Most Frequent Venues in the Wards" , fontsize=15)
plt.xlabel("Venue Category", fontsize=5)
plt.ylabel ("Frequency", fontsize=8)
plt.savefig("Most_Freq_Venues1.png", dpi=300, bbox_inches = "tight")
fig = plt.figure(figsize=(18,7))
plt.tight_layout()
plt.show()
How many unique venue categories in each district...
no_venues_each = Tokyo_venues_csv.groupby('District')['Venue Category'].nunique()
no_venues_eachh= pd.DataFrame(no_venues_each)
no_venues_eachh = no_venues_eachh.rename(columns = {"Venue Category" : "NoofCategory"}).reset_index()
list_ven_no =no_venues_eachh['NoofCategory'].to_list()
list_dist =no_venues_eachh['District'].to_list()
no_venues_eachh
palett = sns.color_palette("viridis",18)
ss=sns.barplot(x="NoofCategory", y="District", data= no_venues_eachh,palette= palett)
# ss.set_xticklabels(ss.get_xticklabels())
plt.title('Number of Venue Categories in Each District', fontsize=15)
plt.xlabel("Number of Venue Categories")
plt.ylabel ("District")
plt.savefig("Unique_venues_each_district.png", dpi=300, bbox_inches = "tight")
fig = plt.figure(figsize=(18,7))
plt.show()
How many unique venue categories in each ward...
#How many uniqye vennues...in the 4 wards
no_venues_each_w = Tokyo_venues_csv.groupby('Ward')['Venue Category'].nunique()
no_venues_eachh_w= pd.DataFrame(no_venues_each_w)
no_venues_eachh_w = no_venues_eachh_w.rename(columns = {"Venue Category" : "NoofCategory"}).reset_index()
list_ven_no_w =no_venues_eachh_w['NoofCategory'].to_list()
list_ward =no_venues_eachh_w['Ward'].to_list()
no_venues_eachh_w
no_venues_eachh_w.info()
palett = sns.color_palette("rocket_r")
ss=sns.barplot(x="NoofCategory", y="Ward", data= no_venues_eachh_w,palette= "coolwarm")
# ss.set_xticklabels(ss.get_xticklabels())
plt.title('Number of Venue Categories in Each Ward', fontsize=15)
plt.xlabel("Number of Venue Categories")
plt.ylabel ("Ward")
plt.savefig("Unique_venues_each_ward.png", dpi=300, bbox_inches = "tight")
fig = plt.figure(figsize=(18,7))
plt.show()
See which districts are in the ward that has most unique venues...
numdist_Minato = Tokyo_venues_csv["District"][Tokyo_venues_csv["Ward"]=='Minato'].unique()
numdist_Minato.tolist()
To do prescriptive analytics to help a tourist decide a location to go, I will use K-means clustering; an unsupervised machine learning algorithm, which creates clusters of data points aggregated together based on similarities.
One hot encoding:
#get_dummies is a way to create dummy variables for categorical features.
# act like 'switches' that turn various parameters on and off in an equation
tokyo_onehot = pd.get_dummies(Tokyo_venues_csv[['Venue Category']], prefix="", prefix_sep="")
# add district column back to dataframe
tokyo_onehot['District'] = Tokyo_venues_csv['District']
# move district column to the first column
fixed_columns = [tokyo_onehot.columns[-1]] + list(tokyo_onehot.columns[:-1])
tokyo_onehot = tokyo_onehot[fixed_columns]
tokyo_onehot[100:300]
tokyo_onehot.shape
Next, let's group rows by district and by taking the mean of the frequency of occurrence of each category
Tokyo_grouped = tokyo_onehot.groupby('District').mean().reset_index()
Tokyo_grouped
Let's print each district along with the top 15 most common venues.
num_top_venues = 15
for hood in Tokyo_grouped['District']:
print("----"+hood+"----")
temp = Tokyo_grouped[Tokyo_grouped['District'] == hood].T.reset_index()
temp.columns = ['venue','freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')
Now let's create a dataframe and display the top 10 venues for each district.
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['District']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['District'] = Tokyo_grouped['District']
for ind in np.arange(Tokyo_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Tokyo_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted
After using one hot encoding and taking the mean of the frequency for each venue category, let's use K-means clustering to create K clusters of data points based on similarities.
First, to implement this algorithm, it is very important to determine the optimal number of clusters (aka k).
Elbow method:
Tokyo_grouped_clustering = Tokyo_grouped.drop('District', 1)
# determine k using elbow method
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt
# k means determine k
distortions = []
K = range(1,11)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(Tokyo_grouped_clustering)
kmeanModel.fit(Tokyo_grouped_clustering)
distortions.append(sum(np.min(cdist(Tokyo_grouped_clustering, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / Tokyo_grouped_clustering.shape[0])
# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method Showing the Optimal k')
plt.show()
Sometimes, Elbow method does not give the required result, which happened in this case.
Let's try a different method of finding the best value for k.
Silhouette_score method:
from sklearn.metrics import silhouette_score
sil = []
k_sil = range(2,11)
#min of 2 clst req to define dissim.
for k in k_sil:
print(k, end=" ")
kmeans = KMeans(n_clusters = k).fit(Tokyo_grouped_clustering)
labels = kmeans.labels_
sil.append(silhouette_score(Tokyo_grouped_clustering,labels, metric = 'euclidean'))
plt.plot(k_sil, sil, 'bo-')
plt.xlabel('k')
plt.ylabel('silhouette_score')
plt.title('Silhouette Method Showing the Optimal k')
plt.show()
There is a peak at k = 3. However, three number of clusters will cluster the districts very broadly.
Therefore, in this case, the number of clusters (i.e. ‘k’) is chosen to be 4.
Run k-means to cluster the districts into 4 clusters.
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 4
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Tokyo_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
k_means_labels = kmeans.labels_
k_means_labels
k_means_cluster_centers = kmeans.cluster_centers_
Create a new dataframe that includes the cluster column as well as the top 10 venues for each district.
# Add clustering labels
# neighborhoods_venues_sorted.drop(columns=['Cluster Labels'], inplace=True) #for re-run
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
# tokyo_mergedd = Tokyo_venues_csv
# merge with df_tokyo to add latitude/longitude for each district
# to obtain the final result
tokyo_mergedd = df_tokyo.join(neighborhoods_venues_sorted.set_index('District'), on='District')
tokyo_mergedd['Cluster Labels'] = tokyo_mergedd['Cluster Labels'].fillna(0)
tokyo_mergedd['Cluster Labels'] = tokyo_mergedd['Cluster Labels'].astype(int)
tokyo_mergedd = tokyo_mergedd.drop(columns=[ 'Density'])
tokyo_mergedd.head() # check the last columns!
tokyo_mergedd.to_csv('tokyo_mergedd_with_labels.csv',index=False)
tokyo_mergedd = pd.read_csv('tokyo_mergedd_with_labels.csv')
#Find null or NaN rows
tokyo_mergedd[tokyo_mergedd.isna().any(axis=1)]
Finally, let's visualize the resulting clusters
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tokyo_mergedd['District Latitude'], tokyo_mergedd['District Longitude'], tokyo_mergedd['District'], tokyo_mergedd['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=10,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=1).add_to(map_clusters)
#Display the attrictions
# add markers to map
for lat, lng, label in zip(df['Attraction Latitude'], df['Attraction Longitude'], df['Attraction']):
label = folium.Popup(label, parse_html=True)
folium.Marker(
[lat, lng],
popup=label,
icon = folium.Icon(color='cadetblue',icon = 'heart',prefix='glyphicon')
).add_to(map_clusters)
map_clusters
map_clusters.save('visualize clusters.html')
Let's examine each cluster and determine the discriminating venue categories that distinguish each cluster.
tokyo_mergedd.loc[tokyo_mergedd['Cluster Labels'] == 0, tokyo_mergedd.columns[[0] + list(range(1, tokyo_mergedd.shape[1]))]]
tokyo_mergedd.loc[tokyo_mergedd['Cluster Labels'] == 0]['1st Most Common Venue'].value_counts()
tokyo_mergedd.loc[tokyo_mergedd['Cluster Labels'] == 1, tokyo_mergedd.columns[[0] + list(range(1, tokyo_mergedd.shape[1]))]]
tokyo_mergedd.loc[tokyo_mergedd['Cluster Labels'] == 1]['1st Most Common Venue'].value_counts()
Cluster3 = tokyo_mergedd.loc[tokyo_mergedd['Cluster Labels'] == 2, tokyo_mergedd.columns[[0] + list(range(1, tokyo_mergedd.shape[1]))]]
Cluster3
tokyo_mergedd.loc[tokyo_mergedd['Cluster Labels'] == 2]['1st Most Common Venue'].value_counts()
tokyo_mergedd.loc[tokyo_mergedd['Cluster Labels'] == 3, tokyo_mergedd.columns[[0] + list(range(1, tokyo_mergedd.shape[1]))]]
tokyo_mergedd.loc[tokyo_mergedd['Cluster Labels'] == 3]['1st Most Common Venue'].value_counts()
# one hot encoding
tokyo_onehot = pd.get_dummies(Tokyo_venues_csv[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
tokyo_onehot['Ward'] = Tokyo_venues_csv['Ward']
# move neighborhood column to the first column
fixed_columns = [tokyo_onehot.columns[-1]] + list(tokyo_onehot.columns[:-1])
tokyo_onehot = tokyo_onehot[fixed_columns]
tokyo_onehot[100:300]
Tokyo_grouped2 = tokyo_onehot.groupby('Ward').mean().reset_index()
Tokyo_grouped2
num_top_venues = 25
for hood in Tokyo_grouped2['Ward']:
print("----"+hood+"----")
temp = Tokyo_grouped2[Tokyo_grouped2['Ward'] == hood].T.reset_index()
temp.columns = ['venue','freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Ward']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted2 = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted2['Ward'] = Tokyo_grouped2['Ward']
for ind in np.arange(Tokyo_grouped2.shape[0]):
neighborhoods_venues_sorted2.iloc[ind, 1:] = return_most_common_venues(Tokyo_grouped2.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted2
As there are only 4 wards, they could be clustered into 4 different clusters without using K-Means since it is actually a concentrate of the information already obtained, as seen in above dataframe. However, applying K-Means may allow to affirm the results.
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 4
Tokyo_grouped2 = Tokyo_grouped2.drop('Ward', 1)
# run k-means clustering
kmeans2 = KMeans(n_clusters=kclusters, random_state=0).fit(Tokyo_grouped2)
# check cluster labels generated for each row in the dataframe
kmeans2.labels_[0:10]
# Add clustering labels
# neighborhoods_venues_sorted2.drop(columns=['Cluster Labels'], inplace=True) #for re-run
neighborhoods_venues_sorted2.insert(0, 'Cluster Labels', kmeans2.labels_)
tokyo_mergedd2 = neighborhoods_venues_sorted2
# tokyo_mergedd2 = df_tokyo.join(neighborhoods_venues_sorted2.set_index('Ward'), on='Ward')
tokyo_mergedd2['Cluster Labels'] = tokyo_mergedd2['Cluster Labels'].fillna(0)
tokyo_mergedd2['Cluster Labels'] = tokyo_mergedd2['Cluster Labels'].astype(int)
tokyo_mergedd2 # check the last columns!
tokyo_mergedd2.loc[tokyo_mergedd2['Cluster Labels'] == 0, tokyo_mergedd2.columns[[0] + list(range(1, tokyo_mergedd2.shape[1]))]]
tokyo_mergedd2.loc[tokyo_mergedd2['Cluster Labels'] == 1, tokyo_mergedd2.columns[[0] + list(range(1, tokyo_mergedd2.shape[1]))]]
tokyo_mergedd2.loc[tokyo_mergedd2['Cluster Labels'] == 2, tokyo_mergedd2.columns[[0] + list(range(1, tokyo_mergedd2.shape[1]))]]
tokyo_mergedd2.loc[tokyo_mergedd2['Cluster Labels'] == 3, tokyo_mergedd2.columns[[0] + list(range(1, tokyo_mergedd2.shape[1]))]]
As analyzing, we were able to identify 4 major tourism wards or neighborhoods and cluster a total of 184 venue categories in the wards with 19 major districts into 4 clusters. The clusters can be generalized as the following:
Cluster 1 — 7 districts, The most common venues are restaurants/bbq joints/bar.
Cluster 2 — 6 districts, The most common venues are restaurants/coffee shops/park.
Cluster 3 — 5 districts, The most common venues are restaurants/museum/hotel.
Cluster 4 — 1 district, The most common venue is restaurant.
If we cluster the venues by wards, the clusters can be generalized as the following:
Cluster 1 — Minato; The most common venue is hotel.
Cluster 2 — Taitō; The most common venue is ramen Restaurant.
Cluster 3 — Shibuya; The most common venue is coffee Shop.
Cluster 4 — Sumida; The most common venue is BBQ Joint.
According to the results, it showed that the majority of venues in Tokyo’s popular areas are eateries. However, the model also showed different most common activities to participate in once a user in Tokyo.
According to the above a
Conclusion We explored how machine learning can help the tourism industry, how the modern day tourism industry uses machine learning, How to build a simple recommendation and suggestion system. The tourism industry is in shambles at the moment of COVID-19 paramedic, but once the paramedic situation gets better, it will most likely jump back to the high numbers. Machine learning will be right there, helping companies stay afloat and generate growth.
It is no doubt that Seattle is a coffee-saturated city. Based on Four Square API search result, we can see the coffee shop venue category is always the most common venue in different neighborhoods.
Among of all neighborhoods, Northgate is the probably the best neighborhood to start a food delivery business. According to the findings, the top six of the most common venues are related to food or drinks with different varieties. Those venues categories are Sushi Restaurant, Sandwich Place, Mexican Restaurant, Thai Restaurant, Coffee shop and Pizza place. Hence, the chances of getting food order delivery are very high.
From the map of K-means clustering result, the three neighborhoods, Pioneer Square, Downtown and Belltown are very close to each other geographically. The most common venues category in those areas is the Coffee shop, Italian Restaurant, Sushi Restaurant, and American Restaurant, etc. Hence, this special condensed area is definitely a good option for the office location of the food delivery service other than Northgate.
This result of this analysis has some limitations. It cannot reflect the behavior of the customer in that area. People like to visit the restaurants or stop by the coffee shops does not mean they want food delivery services. The customers may enjoy the service and the time of being there, more than just having the food at their own places. Rental cost is not taken into consideration too. The three neighborhoods Pioneer Square, Downtown and Belltown may be the most expensive rental area in Seattle